The RoadRunner Web Data Extraction System

نویسندگان

  • Valter Crescenzi
  • Giansalvatore Mecca
  • Paolo Merialdo
چکیده

Extracting data from HTML text files and making them available to computer applications is becoming of utmost importance for developing several emerging e-services. This paper presents RoadRunner, a research project that aims at developing solutions for automatically extracting data from large HTML data sources. We concentrate on data-intensive Web sites, that is, sites that deliver large amounts of data through a complex graph of linked HTML pages. The paper describes the top-level software architecture of the RoadRunner System, which has been specifically designed to automatize the data extraction process. The paper is organized as follows. First, Section 2 illustrates an overview of the project and gives an intuition of its key ideas. Then, Section 3 describes the overall architecture of the RoadRunner system. Section 4 concludes the paper discussing related works.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Web Information Extraction in the ROADRUNNER System

This paper presents roadRunner, a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites with a fairly complex structure, that publish large amounts of data. The paper describes the top-level software architecture of the roadRunner System, and the novel res...

متن کامل

RoadRunner for Heterogeneous Web Pages Using Extended MinHash

The Internet presents large amount of useful information which is usually formatted for its users, which makes it hard to extract relevant data from diverse sources. Therefore, there is a significant need of robust, flexible Information Extraction (IE) systems that transform the web pages into program friendly structures such as a relational database will become essential. IE produces structure...

متن کامل

The ROADRUNNER Project: Towards Automatic Extraction of Web Data

ROADRUNNER is a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites that publish large amounts of data in a fairly complex structure. In our view, we aim at ideally seeing the data extraction process of a data-intensive Web site as a black-box taking as ...

متن کامل

Information Extraction State of the art

In this report, we first survey major Web data extraction tools described in the literature, then we present an overview of the ROADRUNNER system, which represents our scientific base in the project: we first describe the overall approach, then we argument the main limitations of the current implementation.

متن کامل

Trinity: Unsupervised Web Data Extraction Using Ternary Trees

ARTICLE INFO Internet presents a huge collection of useful information so extracting information from web document has become research area for which web data extractors are used. This technique works on two or more web documents generated by same sever side template and learns a regular expression that models it and then used it for extracting data from similar documents. The technique introdu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001